Add 500 HTTP Error to retry list #2567

farzadab · 2024-09-25T18:08:30Z

As reported in #2559, reading from HF Hub datasets can sometimes lead to 500 Internal Server Errors, which even if they're rare, can cause issues on large training runs.

The source of this error is not yet identified, but a simple fix seems to be to add the missing 500 error code to the retry list.

Wauplin · 2024-09-26T12:37:49Z

hI @farzadab, thanks for the PR. http_backoff is called 3 times in hf_file_system.py to download files in different ways. Could you update all 3 of them? Thanks!

farzadab · 2024-09-26T16:09:55Z

My bad. Done.

HuggingFaceDocBuilderDev · 2024-09-27T09:20:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Wauplin

Thanks! Looking good now :)

For the record, we've discussed this change internally (private slack). We had some concerns that retrying on HTTP 500 without knowing the root cause is clunky (can be transient error or not) but since:

we retry only up to 5 times (so ~1min)
and only when using hf_file_system so basically when streaming a dataset from the Hub

Then it's fine do to it. The cost of breaking a training process because of transient HTTP 500 is higher than when breaking a download process when instantiating a model once.

All of this to say, it's good to merge! 🤗

Add 500 error to retry list

5c132ea

farzadab mentioned this pull request Sep 25, 2024

HfHubHTTPError: 500 Server Error: Internal Server Error #2559

Closed

farzadab and others added 2 commits September 26, 2024 09:08

Add 500 error to retry list

189f106

Merge branch 'main' into patch-1

8938faf

Wauplin approved these changes Sep 27, 2024

View reviewed changes

Wauplin merged commit 476fa0b into huggingface:main Sep 27, 2024
16 checks passed

farzadab deleted the patch-1 branch September 30, 2024 16:48

farzadab mentioned this pull request Sep 30, 2024

Monkey patch for HF Hub error fixie-ai/ultravox#120

Closed

YiboZhao624 mentioned this pull request Oct 20, 2025

Cannot Download the Model Checkpoints #3455

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add 500 HTTP Error to retry list #2567

Add 500 HTTP Error to retry list #2567

Uh oh!

farzadab commented Sep 25, 2024

Uh oh!

Wauplin commented Sep 26, 2024

Uh oh!

farzadab commented Sep 26, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Sep 27, 2024

Uh oh!

Wauplin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add 500 HTTP Error to retry list #2567

Add 500 HTTP Error to retry list #2567

Uh oh!

Conversation

farzadab commented Sep 25, 2024

Uh oh!

Wauplin commented Sep 26, 2024

Uh oh!

farzadab commented Sep 26, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Sep 27, 2024

Uh oh!

Wauplin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants